NLP 04L - Visualizing Vector Arithmetic Lab(Python)
Loading...

Visualizing Vector Arithmetic Lab

Spark Logo Tiny In this lab you:

  • Apply and visualize basic vector arithmetic to embeddings
  • Calculate cosine similarity between vectors
%pip install gensim==3.7
Python interpreter will be restarted. Collecting gensim==3.7 Using cached gensim-3.7.0-cp37-cp37m-manylinux1_x86_64.whl (24.2 MB) Processing /root/.cache/pip/wheels/83/a6/12/bf3c1a667bde4251be5b7a3368b2d604c9af2105b5c1cb1870/smart_open-3.0.0-py3-none-any.whl Requirement already satisfied: scipy>=0.18.1 in /databricks/python3/lib/python3.7/site-packages (from gensim==3.7) (1.4.1) Requirement already satisfied: numpy>=1.11.3 in /databricks/python3/lib/python3.7/site-packages (from gensim==3.7) (1.18.1) Requirement already satisfied: six>=1.5.0 in /databricks/python3/lib/python3.7/site-packages (from gensim==3.7) (1.14.0) Requirement already satisfied: requests in /databricks/python3/lib/python3.7/site-packages (from smart-open>=1.7.0->gensim==3.7) (2.22.0) Requirement already satisfied: certifi>=2017.4.17 in /databricks/python3/lib/python3.7/site-packages (from requests->smart-open>=1.7.0->gensim==3.7) (2020.6.20) Requirement already satisfied: idna<2.9,>=2.5 in /databricks/python3/lib/python3.7/site-packages (from requests->smart-open>=1.7.0->gensim==3.7) (2.8) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /databricks/python3/lib/python3.7/site-packages (from requests->smart-open>=1.7.0->gensim==3.7) (1.25.8) Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/lib/python3/dist-packages (from requests->smart-open>=1.7.0->gensim==3.7) (3.0.4) Installing collected packages: smart-open, gensim Successfully installed gensim-3.7.0 smart-open-3.0.0 Python interpreter will be restarted.
%run ../Includes/Classroom-Setup

We need the gensim library to load in the pretrained GloVe vectors again.

import gensim.downloader as api
 
# Load GloVe
word_vectors = api.load("glove-wiki-gigaword-100")

We are going to recreate the DataFrame and graph showcasing the embeddings of the words "man", "woman", "king", and "queen."

import matplotlib.pyplot as plt
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
 
def get_embed_df(words):
  # Create df
  vecs = [Vectors.dense([val.item() for val in word_vectors[word]]) for word in words]
  df_list = [(i, word, vec) for i, (word, vec) in enumerate(zip(words, vecs))]
  df = spark.createDataFrame(df_list, ["id", "word", "vectors"])
  
  # Reduce to 2 dim for plotting
  pca = PCA(k=2, inputCol="vectors", outputCol="2d_vectors")
  model = pca.fit(df)
  return model.transform(df)
 
words = ["man", "woman", "king", "queen", "object"] 
embed_df = get_embed_df(words)
 
# Get the 4 vectors we are interested in
vectors = [row[0] for row in embed_df.select("2d_vectors").collect()[:4]]
 
# Plot
def plot_vectors(vectors, words, title="Visualizing Word Embeddings", xlim1=-2, xlim2=5, ylim1=-.5, ylim2=4.5):
  for coord,word in zip(vectors, words):
    plt.quiver(0, 0, coord[0], coord[1], angles="xy", scale_units="xy", scale=1)
    plt.text(coord[0]+0.05, coord[1], word)
  
  plt.title(title)
  plt.xlim(xlim1, xlim2)
  plt.ylim(ylim1, ylim2)
  
plot_vectors(vectors, words)
display(plt.show())
plt.gcf().clear()

Now we want to be able to visualize the vector arithmetic involved with the word_vectors.most_similar(positive=['woman', 'king'], negative=['man']) call. The call will return the word with the embedding which is closest to the vector resulting from adding the embeddings in positive and subtracting the embeddings in negative.

To visualize what the resulting vector should be, we will continue to work with the reduced 2D representations of the pretrained GloVe embeddings and first add the positive embeddings woman and king before subtracting the embedding of man.

Fill in the following 2 lines to get

  1. the intermediate woman + king embedding
  2. the final woman + king - man embedding

and run the cell to plot the original 4 vectors with the 2 newly calculated vectors.

# TODO
 
# Plots 4 vectors from graph above
plot_vectors(vectors, words, "Visualizing 'woman+king-man'")
 
# Woman + king
#w_plus_k = # FILL_IN
w_plus_k = vectors[words.index("woman")] + vectors[words.index("king")]
# Woman + king - man
#w_plus_k_minus_m = # FILL_IN
w_plus_k_minus_m = w_plus_k - vectors[words.index("man")]
 
# formats new vectors and texts in graph
# w_plus_k
plt.quiver(0, 0, w_plus_k[0], w_plus_k[1], angles='xy', scale_units='xy', scale=1, color = "blue")
plt.text(w_plus_k[0]+0.1, w_plus_k[1], "woman+king")
# w_plus_k_minus_m
plt.quiver(w_plus_k[0], w_plus_k[1], w_plus_k_minus_m[0]-w_plus_k[0], w_plus_k_minus_m[1]-w_plus_k[1], 
           angles='xy', scale_units='xy', scale=1, color = "red")
plt.text(w_plus_k_minus_m[0]+0.1, w_plus_k_minus_m[1]+0.05, "(woman+king)-man")
 
display(plt.show())
plt.gcf().clear()

Even though we reduced the embeddings down to 2 dimensions, what word does the resulting embedding still look the closest to? Does this match the answer that word_vectors.most_similar(positive=['woman', 'king'], negative=['man']) returns?

Note: Since vector addition and subtraction are commutative, you can change the order of the operations and see that the resulting woman+king-man vector still approximates the queen embedding.

Now we are going to explore what exactly "similar" embeddings mean.

Recall the definition of cosine similarity from above: CosineSimilarity (u,v)=cos( angle between vectors )=uvuv\text{CosineSimilarity }(u,v) = \cos(\text{ angle between vectors }) = \frac{u \cdot v}{|u| |v|}

Run the following cell to see what the gensim default function returns for the similarity between queen and king.

word_vectors.similarity("queen", "king")
Out[9]: 0.7507691

Now fill in the following cos_similarity function to calculate the cosine similarity between 2 vectors (np arrays) of equal dimensions.

Hint: Take a look at some numpy functions.

# TODO
import numpy as np
 
def cos_similarity(v1, v2):
  return np.dot(v1,v2)/(np.linalg.norm(v1)*np.linalg.norm(v2))# FILL_IN
 
word1 = "queen"
word2 = "king"
ans = cos_similarity(word_vectors[word1], word_vectors[word2])
print(ans)
 
assert round(ans, 3) == round(word_vectors.similarity(word1, word2), 3), "Your answer does not match Gensim's"
0.750769